Boosting ridge for the extreme learning machine globally optimised for classification and regression problems

This paper explores the boosting ridge (BR) framework in the extreme learning machine (ELM) community and presents a novel model that trains the base learners as a global ensemble. In the context of Extreme Learning Machine single-hidden-layer networks, the nodes in the hidden layer are preconfigured before training, and the optimisation is performed on the weights in the output layer. The previous implementation of the BR ensemble with ELM (BRELM) as base learners fix the nodes in the hidden layer for all the ELMs. The ensemble learning method generates different output layer coefficients by reducing the residual error of the ensemble sequentially as more base learners are added to the ensemble. As in other ensemble methodologies, base learners are selected until fulfilling ensemble criteria such as size or performance. This paper proposes a global learning method in the BR framework, where base learners are not added step by step, but all are calculated in a single step looking for ensemble performance. This method considers (i) the configurations of the hidden layer are different for each base learner, (ii) the base learners are optimised all at once, not sequentially, thus avoiding saturation, and (iii) the ensemble methodology does not have the disadvantage of working with strong classifiers. Various regression and classification benchmark datasets have been selected to compare this method with the original BRELM implementation and other state-of-the-art algorithms. Particularly, 71 datasets for classification and 52 for regression, have been considered using different metrics and analysing different characteristics of the datasets, such as the size, the number of classes or the imbalanced nature of them. Statistical tests indicate the superiority of the proposed method in both regression and classification problems in all experimental scenarios.

• The optimisation of the weights of the output layer of a Boosting Ridge for Extreme Learning Machine ensemble in a single step instead of iteratively, with the objective of reducing the generalisation error. • The use of different input layers mappings with different parameters for their hidden layers, made possible by the new optimisation approach resulting in the so-called Generalised Global BRELM (GGBRELM), tends to a better diversity of the ensemble. • Avoid the problem of ensemble saturation and overtraining by making the new proposal work well when the base classifiers become stronger. For example, it is known that by increasing the number of neurons in the ELM networks of the base learners, each one achieves good performance, but, in return, the ensemble's performance is reduced. With the new proposal, this problem is solved. • The application of the methodology to more than 120 classification and regression datasets from different domains shows that the proposal works better than the state-of-the-art methods and can be applied to any real-world problem. • The performance of the proposed methodology analysis considers different dataset properties such as size, number of classes or imbalance.
This paper is organised as follows: "State-of-the-art algorithms" Section summarises the notation and formulation of the ELM, BR and BRELM algorithms. "Methodology of the proposal" Section develops the proposed methodology about the globalisation of BRELM and its generalised version GGBRELM, shows a graphical comparison of the methodologies, and includes an analysis of their computational costs. The experimental design is set in Section experimental design, while "Discussion of the results" Section explains the most highlighted results, including statistical analysis. Finally, "Conclusions" Section collects the main conclusions obtained in the work.

State-of-the-art algorithms
This section introduces the notation and formulation of the two algorithms on which this proposal is based, i.e., ELM predictor and BR ensemble methodology. www.nature.com/scientificreports/ Extreme learning machine. For a simple supervised learning problem, dataset D = {(x 1 , y 1 ), . . . , (x n , y n ), . . . , (x N , y N )} = {(x n , y n )} N n=1 consists in a set of N patterns, each one with a vector of features, x n and target associated, y n .
• x n ∈ R K is the data information for the n-th pattern, where K is the number of input variables. • y n is the target variable for the n-th pattern. In case of regression problems, y n ∈ R since it is a number. In classification problems with J classes, the target can be expressed as "1-of-J" encoding, y n ∈ R J . Each component j of y n is y j,n = 1 if n-th pattern belongs to class j and y j,n = 0 otherwise.
Using "1-of-J" encoding, a classification can be rewritten as a multi-regression problem. Thus, ELM model is explained for regression problems in this subsection, and the explanation for classification is summed up at the end. A predictor f : R K → R inferring a function that maps an input n-th pattern x n to an output target y n , using relationships from labeled dataset D = {(x n , y n )} N n=1 . In particular, Extreme Learning Machine (ELM) model build this function: where: • h : R K → R D is a non-linear mapping of the input layer. It transforms the pattern x n from the original feature space R K to the transformed space R D , where D is the number of neurons in the hidden layer. This mapping is explicitly computed as with φ : R K → R as the activation function for the neuron d, and the weights w d and biases b d are randomly generated. • β : R D is the vector of weights in the output layer, that are found in the optimisation problem: where H = h ′ (x 1 ), . . . , h ′ (x N ) ∈ R N×D is the output of the hidden layer for the training patterns, Y =    y 1 . . . y N    ∈ R N is the matrix with the desired targets and C > 0 is an user-specified term, that controls the regularisation in the model 12 . Equation (3) represents a convex minimisation problem with error and regularisation terms. The error term �Hβ − Y� 2 adjusts the coefficient vector β in order to minimise the error of the prediction Y , while the regularisation term β j 2 is included to avoid over-fitting in the model 43 . The optimal solution for the model is the minimum of the convex objective function in Eq. (3), and it is obtained by deriving and equaling to 0: For a classification problem, there are J minimisation problems as Eq. (3). The predicted class corresponds to the vector component with the highest value, that is Boosting ridge regression (linear model). From a linear regression model, and its associated minimisation problem Tutz et al. 32 proposed BR Regression as ensemble learning method that reduces sequentially the residual of the ensemble prediction,  42 . The prediction of this sequential ensemble, BRELM, of S base learners is the following linear combination: The first base learner s = 1 is the standard ELM solution from Eq. (3). Later, the s-th base learner training stage uses all the data, but the target µ (s) is the residual of the previous base learner predictions, Therefore, the minimisation problem of the s-th base learner is and the solution for the output layer of the s-th base learner is

Methodology of the proposal
In this section, the Globalisation of the BRELM is proposed, along with an enhanced version called Generalised Global BRELM (GGBRELM). A methodological graphical comparison is also included. And finally, a theoretical analysis of the methodologies' computational complexities is discussed. The main hypothesis of this work is that the methodology based on the optimisation of all the base learners in a single step will improve the generalisation error of the ensemble. Thus, considering that this procedure will avoid the saturation of the ensemble, and therefore, for a high number of neurons (strong ELM base learners), the ensemble performance will not be reduced. Besides, the use of different input layer weights and, therefore, different mapping functions ( h (s) ) between the different base predictors will lead to more diversity in the ensemble.
Global boosting ridge for extreme learning machine. The main idea behind BRELM is to reduce sequentially the error produced by the ensemble. This proposal, Global BRELM, presents the problem for each s-th base learner as the error reduction of the other base learners of the ensemble.
min  Then, ELM must perform matrix inversion on a D × D matrix whose complexity is O(D 3 ) as shown in 46,47 . After that, a multiplication of the H ′ Y , that is, D × N by N × J with a cost of O(D · N · J) . Finally, the resulting matrices D × D and D × J are multiplied with a computational time of O(D 2 · J) . Therefore, the total computational complexity is

Methodology flowcharts.
The computational cost for the BRELM and GBRELM methods also depends on the number of base learners S. Since these methodologies train S ELM models sequentially and each model is trained using the residual from the previous one as targets, the computational cost will be O(S · O(ELM) + (S − 1)(N · D · J)).
Finally, considering that GGBRELM performs optimisation in a single step, the method must calculate a matrix inversion of a DS × DS matrix and multiply the result with a DS × NJ matrix. Given that the H ′ H matrix is symmetric, the computation of all the intermediate H s ′ H t for s = 1, . . . , S, t = s, . . . , S , a total of S(S − 1)/2 multiplications of matrices D × N by N × D need to be performed, resulting in a complexity of O(S(S − 1)/2 · D · N 2 ) . For this reason, the computational cost of GGBRELM is O(S(S − 1)/2 · D · N 2 + (DS) 3 + (DS) 2 · J + DS · N · J).

Experimental design
In order to evaluate the methodology presented in "Methodology of the proposal" Section, a comprehensive experimental environment has been implemented. In this sense, "Experiments" Section describes the experiments performed initially. "Datasets" Section includes a description of the datasets employed in the regression and classification problems. "Algorithms and parameters setting" Section contains a concise explanation of the algorithms selected for performing the comparative study and the set-up of their hyperparameters. Finally, the metrics implemented for the evaluation of the models are detailed in "Measures" Section, and the statistical tests carried out to validate the obtained results are defined in "Statistical tests" Section. Experiments. As stated before, the aim of this work is not only to improve the performance of the base learner (ELM) but also to overcome the disadvantages of the BRELM and, specifically, Generalised BRELM (16)   www.nature.com/scientificreports/ (GBRELM). Also, for comparison purposes, a recent kernel methodology is used (KBRELM, see Algorithms and parameters setting" Section). For this purpose, two experiments have been carried out: • In the first experiment (E1), the number of neurons in the hidden layer was low. Thus, the smaller the number of hidden nodes, the worse ELM performs; on the other hand, GBRELM performs better. • In the second experiment (E2), the number of nodes in the hidden layer is larger. Thereby, the performance capabilities of the ELM are high (strong learners), so this model achieves competitive results. At the same time, the GBRELM ensemble cannot take advantage of its ensemble architecture to improve its performance. As a classic ensemble, its performance increases when weak learners are used and decreases when complex learners are used.
In both experiments, the performance of the methodologies in the datasets will be analysed according to their size. Also, for the classification problems, the number of classes and the imbalance ratio, calculated as the ratio resulting from dividing the number of patterns of the majority class by the number of patterns of the minority class, will be examined. The underlying idea is to demonstrate that GGBRELM outperforms ELM, GBRELM and KBRELM in both experimental scenarios by comparing them in regression and classification problems and performing an analysis according to different dataset properties.
Datasets. Experimental validation has been performed on 71 classification datasets and 52 regression datasets, respectively. This selection was carried out to include in the reference datasets various types of classification/regression problems in terms of their field of application, their size (product of the number of patterns times the number of attributes), their number of classes, and their imbalanced ratio. Tables 1 and 2 show a summary of the main characteristics of the selected datasets: identification number (ID), which has been assigned by  www.nature.com/scientificreports/ ordering the datasets from the highest to the lowest size, name (Dataset), number of instances (#Inst.), attributes (#Attr.) and size (Size). According to their size, databases have been divided into large (size > 100000), medium (10000 < size < 100000) and small (size < 10000). The number of classes (#Classes), their distribution (Class distribution) and the imbalanced ratio (IR) have also been included in the characterisation of the classification problem datasets (Table 1). Imbalanced datasets (IR > 2) have also been underlined for further analysis. From here to the end, the datasets are annotated according to their ID. While classification datasets are extracted from UCI Machine Learning Repository 48 , regression benchmark problems come from different machine learning repositories: UCI, Department of Statistics in the University of Florida 49 and LIACC 50 .
Algorithms and parameters setting. The proposed method has been evaluated by comparing its results with respect to other recent state-of-the-art ELM proposals. The comparison methods are briefly described below: • Extreme Learning Machine (ELM) 12 (described in "Extreme learning machine" Section). In the model implementation, the weights and bias in the hidden layer were randomly generated following a uniform distribution. In contrast, the output weights were optimised using the ELM minimisation problem with L 2 regularisation. • Generalised BRELM (GBRELM) (a version combining the algorithm described in "Boosting ridge extreme learning machine" Section with the generalisation of mapping functions h (s) ). This work compares the generalised version of Boosting Ridge for Extreme Learning Machine since it introduces variability into the model. Thus it would not make sense to compare with a simpler version where all ensemble elements have the same input layer. • Generalised Global BRELM (GGBRELM) (described in Section "Methodology of the proposal"). The proposed methodology improves the sequential Generalised Boosting Ridge original architecture with a global approach. • Kernel BRELM (KBRELM) 39 . In order to compare our proposal with a more recent methodology in the literature, we have also added a Boosting Ridge ensemble using as base learners Kernel Ridge Regression, as in 39 . This method works as the sequential Boosting Ridge for ELM presented in "Boosting ridge regression" Section but uses kernel trick instead of neural mapping. For it, Gaussian kernel was used, with hyperparameter γ , www.nature.com/scientificreports/ www.nature.com/scientificreports/ The performance of the comparison methods depends critically on the setting of two hyperparameters: the regularisation parameter, C, and the number of hidden nodes, D. The hyperparameter C was determined by a grid search in a 5-fold nested cross-validation. The optimal value of the regularisation parameter for all comparison methods was determined with the following grid: C ∈ {10 −2 , 10 −1 , 1, 10, 10 2 } . The number of hidden nodes, D, in all models was set to D = 10 for the first experiment and D = 1000 for the second one. In the case of the KBRELM method, the γ parameter needs to be crossvalidated, so it has been determined with the grid γ ∈ {10 −2 , 10 −1 , 1, 10, 10 2 } . The ensemble size for all the ensemble methods was set to 10 base learners. The experimental results were obtained using a 10-fold cross-validation procedure, with 3 repetitions per fold. Thus, 30 error measures were obtained for all methods compared, ensuring adequate statistical significance of the results. The partitions were the same for all models compared. Input values were standardised, regression labels were scaled to [0, 1] and class labels were binarised, following "1-to-J" encoding 51 .
Measures. The metrics used for performance validation were all standard metrics in their environments, that is, well-known and standard metrics for classification and regression problems. In this regard, the simplicity and success of applying the accuracy rate (Acc) have allowed it to be widely used as a performance measure for classification problems. However, the Acc is unsuitable for imbalanced datasets, which is one of the big tradeoffs when using the accuracy metric. As seen in Table 1, there are a total of 35 datasets with an IR higher than 2, which is the threshold value considered in this work. Therefore, it is more appropriate to use balanced accuracy ( Balanced Accuracy ), which is equal to the accuracy in balanced datasets and considers the imbalance of classes when it exists. In addition, two other classification metrics, Precision (Precision) and F-measure (F1), have also been used because they are useful in balanced and imbalanced scenarios.
Given a binary classification problem (positives and negatives patterns), it is considered: • True positives (TP): positive patterns predicted as positive.
• True negative (TN): false patterns predicted as negative.
Then, these classification performance metrics are mathematically defined as follows: • Balanced Accuracy is the mean of Sensitivity and Specificity. Imbalanced datasets can be addressed by using the average of Sensitivity and Specificity. If a model only predicts accurately for the majority class in the dataset, it will receive a worse Balanced Accuracy score: • Precision is the percentage of positive patterns predicted as positive with respect to the total of positive predicted patterns: • F1 is the harmonic mean of the Precision and Recall: For multi-class problems, the metrics are calculated by comparing one class against all the others. The chosen class is considered positive, while the others are negative. This approach allows for obtaining a metric value for each of the classes. Then, the mean value is obtained. The root mean square error (RMSE) and the determination coefficient ( R 2 ) are the principal measures in the validation of an algorithm for regression problems: • RMSE is the standard deviation of the differences between predicted and target values, and it is defined as: where ŷ(x n ) is the predicted value for pattern x n , and y n , the real one.
• R 2 is the determination coefficient representing the proportion of the variation in the dependent variable that is predictable from the independent variables.
where y and ŷ , are the real and predicted values, respectively.   www.nature.com/scientificreports/ Statistical tests. In order to demonstrate that the GGBRELM model is a promising method in its field, it is crucial to validate its performance with respect to that of the comparison methods with statistical tests. For both experiments and for each metric, a pre-hoc test was applied with the evaluations of the methods on the different datasets to assess the statistical significance of the rank differences. For evaluations where the test detected statistical differences in method rankings, a post-hoc test was conducted to determine which models are distinctive among the multiple comparisons performed using the best performing method as the control method. For this purpose, nonparametric tests were applied. First, nonparametric Friedman's tests 52 , with Balanced Accuracy , Precision and F1 (classification), and RMSE and R 2 (regression) ranking of the models as test variables, were carried out for α = 0.05 . Then, nonparametric Holm's post-hoc test 53 was implemented to determine whether the control method, the GGBRELM, statistically outperforms the comparison methods considering α = 0.05 and taking into account each metric.

Discussion of the results
This section includes the analysis of the experimental results obtained on the selected datasets. This part of the paper has been divided into two sections according to classification and regression datasets. For the sake of conciseness, it has been opted to provide only the relevant graphs and a summary of the statistical results. In those figures, the Y-axis represents the value of the reported metric, while the X-axis contains the IDs of the datasets sorted by size. If GGBRELM is the best for one dataset, its ID appears in bold, and if it is the second best, it appears in italics. Finally, imbalanced datasets are marked with an underline. For the case of the all classification metrics, the higher the point is located on the graph, the better performance of that method since the objective is to maximise these metrics. As a general rule, it can be observed that the GGBRELM methodology outperforms the other approaches in Balanced Accuracy , Precision and F1 in both experiments. Significantly, the difference is greater in those datasets where all the methodologies do not achieve good performances.
In particular, in E1, when comparing Balanced Accuracy , GGBRELM performs better in 31 datasets, and it is the second best in 36, representing almost the total number of databases. For precision, it is the best in 36 datasets and the second one in 30. Moreover, for the F1, GGBRELM is also the best in 36 datasets and the second in 27. GBRELM and KBRELM have similar performance regarding the number of databases in which they are the best or second. ELM performance is lower than the ensemble approaches, according to the literature.
Furthermore, in experiment E2, where the classifiers are configured with a high number of neurons in the hidden layer, the ELM becomes more specialised. Hence its performance improves, and it should outperform the ensemble methods due to its disadvantages when using strong base learners, such as saturation or overfitting. Nevertheless, while it is true that GBRELM and KBRELM obtain worse results than ELM, GGBRELM overcomes this disadvantage of ensemble nature methods by getting more accurate results. Thus, GBBRELM achieves the best result in 27, 30 and 28 datasets in terms of Balanced Accuracy , Precision and F1, respectively, and the second best in 31, 30 and 30 datasets. Thus, the proposed methodology is also better than the three compared methods, as shown in Fig. 3.
As mentioned above, a set of statistical tests have been carried out to analyse the results from statistical hypothesis contrasts, summarising the results in Table 3. For the Friedman's tests and a level of significance α = 5% , the confidence interval is C 0 = (0, F 0.05 = 2.65) , and the F-distribution statistical value considering Balanced Accuracy rankings is F * = 27.80 , considering Precision rankings is F * = 31.69 and taking into account F1 is F * = 22.73 in the experiment E1 (D = 10), while in the E2 experiment (D = 1000), F * = 15 , F * = 10.76 and F * = 9.89 , respectively. Consequently, in both experiments, the test rejects the null-hypothesis stating that all algorithms perform equally in mean ranking of Balanced Accuracy , Precision and F1. That is, the algorithm effect is statistically significant. For this reason, it is considered the best performing method as a control method for a post-hoc test, comparing this algorithm with the rest of the methods. In this way, Table 3 also shows the results of Holm's test. When using GGBRELM as the control algorithm (CA), Holm's test shows that p i < α * i in all cases, for α = 0.05 , confirming that there are statistically significant differences favouring GGBRELM in both experiments and for each metric.
Discussion considering dataset size. As aforementioned, the datasets have been sorted in decreasing order of size and have been divided into three categories according to it, as shown in Table 1: 17 large datasets (IDs 1-17), 25 medium (IDs 18-42) and 29 small ones (IDs 43-71).
Looking at E1, for large datasets, GGBRELM is the best in 8 datasets and the second in the remaining ones for all metrics. It is the best in 12, 13 and 13 medium datasets and the second in 11, 10 and 9 according to Balanced Accuracy , Precision and F1, respectively. For small datasets, the best results are achieved on 11, 15 and 15, and the second best on 16, 11 and 9 datasets, depending on the metric analysed.
For the case of E2, for large datasets, GGBRELM is the best in 11, 10 and 9 and the second best in 4, 6 and 7. For medium datasets, the best are obtained in 6, 10 and 9, while the second best results are achieved in 14, 11 and 10. Finally, the best results are obtained in 10 and the second best in 13 small datasets in all metrics.
As can be seen, regardless of size, the GGBRELM method performs quite well. However, for both E1 and E2, the best results are concentrated in the large datasets being the best or second best method in almost all metrics in both experiments. In the smallest datasets, the improvement of the proposal is not as noticeable as in the remaining ones. It makes sense since they are databases without difficulty and are easier to solve by any method. www.nature.com/scientificreports/ Figure 2. Performance plot on metrics for classification datasets using D = 10. The Y-axis represents the value of the metric, while the X-axis contains the IDs of the datasets sorted by size. If GGBRELM is the best for that dataset, its ID appears in bold, and if it is the second best, it appears in italics. Finally, imbalanced datasets are marked with an underline. www.nature.com/scientificreports/ Figure 3. Performance plot on metrics for classification datasets using D = 1000. The Y-axis represents the value of the metric, while the X-axis contains the IDs of the datasets sorted by size. If GGBRELM is the best for that dataset, its ID appears in bold, and if it is the second best, it appears in italics. Finally, imbalanced datasets are marked with an underline. www.nature.com/scientificreports/ Discussion considering imbalanced datasets. In the experimental validation, there are a total of 35 imbalanced datasets. As stated, for each classification database, the IR has been calculated as the ratio of the number of patterns in the majority class to the number of patterns in the minority class. The IR has been reported in Table 1, underlining those datasets with an IR > 2 . In addition, in Figs. 2 and 3, the IDs of these imbalanced datasets have also been underlined, making it easier to discuss the results by taking them into account.
Considering the first experiment with D set to 10, GGBRELM achieves the best result on 13 datasets and the second best on 18, resulting in almost the total number of databases, considering the Balanced Accuracy metric. Similar is what happens with the other two metrics, being the best in 15 and second best in 15 for Precision and obtaining the best results in 16 and second best in 11 with F1. In this case, it is worth noting that the second method would be GBRELM on average for the three metrics. Although KBRELM obtains the best result in many databases, this showed an unstable behaviour since it is either very good or the worst, depending on the dataset.
As for E2, the same happens for GGBRELM, being the best method for the three metrics in 9, 13 and 12 datasets, respectively, and the second best method in 18, 16 and 13. It is important to note that for imbalanced datasets, the GBRELM method has approximately the same average performance in all metrics with respect to ELM, but ELM is still slightly better than GBRELM.
From this analysis, it can be concluded that the proposed GGBRELM method not only performs well on all metrics for all databases but is also the most appropriate for imbalanced datasets.
Discussion considering the number of classes. From column #Classes in Table 1 and Figs. 2 and 3, the influence of the number of classes on the results obtained can be analysed.
Thus, for E1 and the 44 binary problems, GGBRELM is the best algorithm on average since it is the best on 26, 27 and 28 databases depending on the analysed metric ( Balanced Accuracy , Precision and F1). In addition, it is the second best on 16, 14 and 11, respectively. In the case of multiclass problems, and specifically as the number of classes increases, KBRELM performs similarly to GGBRELM in this experiment. This can be explained by the fact that the higher the number of classes, the more complex the problem becomes, and the algorithms with a higher number of connections benefit, as is the case of kernels.
However, for the case of E2, i.e., when GGBRELM is provided with more neurons in its base classifiers, the results indicate that it performs better on average than the rest of the algorithms in binary and multiclass problems in all metrics. Thus, in binary problems, GGBRELM is the best in 20, 22 and 21, and the second in 14, 13 and 13, respectively. For the case of problems with a more significant number of classes, it is the best in 7, 8 and 7 and the second best in practically the remaining ones, making it the best algorithm on average.

Regression datasets.
The performances of the considered methods for E1 ( D = 10 ) and E2 ( D = 1000 ) in regression datasets are shown in Figs. 4 and 5, respectively ((a) RMSE, (b) R 2 ). As in classification datasets, the Y-axis represents the value of the reported metric, while the X-axis contains the IDs of the datasets sorted by size. If GGBRELM is the best for one dataset, its ID appears in bold, and if it is the second best, it appears in italics. For the case of the RMSE metric, the lower the point is located on the graph, the better performance of Table 3. Results of the Friedman's and Holm's tests using GGBRELM as control algorithm (CA) when comparing its average Balanced Accuracy , Precision and F1 to those of ELM, GBRELM and KBRELM: corrected α values, compared methods and p values, all of them ordered by the number of comparison (i). CA results statistically better than the compared algorithm are marked with (*).  www.nature.com/scientificreports/ that method since the objective is to minimise this metric. The opposite occurs in the R 2 metric because it must be maximised. The findings unambiguously demonstrate that the GGBRELM methodology outperforms the alternative approaches in both experiments and across both metrics. This distinction is especially evident in datasets where the other methodologies exhibit suboptimal performance.
Thus, in the case of E1, GGBRELM is the best method in 44 datasets and the second best in 4 datasets in terms of RMSE. In addition, it is the best method in 43 datasets and the second best in 5 datasets when comparing R 2 . With a low number of neurons, GBRELM also outperforms ELM since it is a weak learner. However, KBRELM does not seem to perform well in problems of this nature, being the worst regressor of the four methods. . Performance plot on metrics for regression datasets using D = 10. The Y-axis represents the value of the metric, while the X-axis contains the IDs of the datasets sorted by size. If GGBRELM is the best for that dataset, its ID appears in bold, and if it is the second best, it appears in italics. www.nature.com/scientificreports/ Furthermore, in experiment E2, GGBRELM overcomes the disadvantage of ensemble nature methods by getting more accurate results regarding RMSE and R 2 . Hence, GGBRELM achieves the better RMSE performance in 34 datasets and the second best in 14. Similarly, it gets the best R 2 in 28 datasets and the second best in 19.

Scientific
In the same way, as in classification datasets, four Friedman's tests have been run showing the rejection of the null-hypothesis given that, for α = 5% , the confidence interval is C 0 = (0, F 0.05 = 2.66) , and the statistical values for RMSE and R 2 are F * = 102.63 and F * = 101.97 in E1, and F * = 77.21 and F * = 91.05 in E2 (Table 4). This Table also shows the results of Holm's test comparing RMSE and R 2 . Again, when using GGBRELM as the control algorithm (CA), Holm's test shows that p i < α * i in all cases, for α = 0.05 , confirming that there are statistically significant differences favouring GGBRELM in both experiments and metrics. Figure 5. Performance plot on metrics for regression datasets using D = 1000. The Y-axis represents the value of the metric, while the X-axis contains the IDs of the datasets sorted by size. If GGBRELM is the best for that dataset, its ID appears in bold, and if it is the second best, it appears in italics. Considering E1, for large datasets, GGBRELM is the best in all datasets for all metrics. For medium size, it is the best in 22 in both metrics and the second in 2 and 3, respectively. For small datasets, the best results are achieved on 15 and 14, and the second best on 2 datasets in both metrics.
For the case of E2, for large datasets, GGBRELM is the best in 6 datasets and the second in 1 for both metrics. For medium datasets, the best are obtained in 19 and 11, while the second best results are obtained in 5 and 11. Finally, for small datasets, the best are obtained in 9 and 11, and the second best in 8 and 7 datasets.
In both experiments, the dataset size does not influence since, in all cases, the GGBRELM algorithm is much better than the others. However, it can be observed how in the five smallest databases, the performance difference of GGBRELM with respect to the other methods decreases since they lack complexity and are susceptible to being solved with any method.

Conclusions
This paper presents a new ensemble methodology that tackles the problem of base learners saturation and a drop in performance when strong base learners are used in the ensemble method, avoiding increase iteratively the size of the ensemble. To solve this, this method performs a global optimisation in the Boosting Ridge methodology, using Extreme Learning Machine models as base learners. The proposed ensemble method, Generalised Global Boosting Ridge for Extreme Learning Machine, generates a set of initial input layer mappings with different parameters for their hidden layers. The output layer weights are optimised in one step, reducing the generalisation error of the ensemble.
A complete experimentation has been carried out, taking into account 71 classification datasets, analysing their size, the number of classes and the imbalance ratio, and 52 regression datasets considering their size, all from different application domains. The experiments show that i) the proposed Generalised Global ensemble method for ELM outperforms Generalised Boosting Ridge in different contexts, that is, low number and high number of neurons, and ii) Generalised Global methodology improves the results of ELM when it is specialised with a high number of neurons, overcoming the disadvantage of ensemble methods in these scenarios. Instead of relying on generating diversity through weak learners (low number of neurons), our method depends on its optimisation in the final prediction of the ensemble as a whole, thus not relying on the implicit diversity of the hidden neurons mapping.
In future work, it planned to adapt the ensemble learning framework to other base learners and other machine learning paradigms, such as ordinal regression or semisupervised learning. And finally, the application of the methodology to real-world problems could be proposed.

Data availability
The databases used together with the code necessary for their extraction are available at https:// github. com/ cpera les/ uci-downl oad-proce ss. The code generated in the experimental design, including the proposed methodology is available at https:// github. com/ cpera les/ pyrid ge. The whole table results obtained during the current study are available from the corresponding author upon reasonable request. Table 4. Results of the Friedman's and Holm's tests using GGBRELM as control algorithm (CA) when comparing its average RMSE and R 2 to those of ELM, GBRELM and KBRELM: corrected α values, compared methods and p-values, all of them ordered by the number of comparison (i). CA results statistically better than the compared algorithm are marked with (*).